The movie database contains information about 10,000 movies collected from The Movie Database (TMDb), including user ratings and revenue. The final two columns ending with “_adj” show the budget and revenue of the associated movie in terms of 2010 dollars, accounting for inflation over time.
- Question 1: How did the production rate change across years?
- Question 2: What is the production rate for top 5 genres overall each genre each year and what is the most dominant genres?
- Question 3: What is the net profit for top 5 genres each year?
- Question 4: Who are the top 10 most successful directors when it comes to net profit?
- Question 5: Is there is a correlation between average rating and net profit?
- Question 6: Who are the top 10 most diverse directors when it comes to genres?
# importing the packages I am intending to use.
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "notebook"
# read my csv file into Dataframe
df = pd.read_csv("tmdb-movies.csv")
# show the head of the df to make sure it's read correctly
df.head(2)
| id | imdb_id | popularity | budget | revenue | original_title | cast | homepage | director | tagline | ... | overview | runtime | genres | production_companies | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 135397 | tt0369610 | 32.985763 | 150000000 | 1513528810 | Jurassic World | Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... | http://www.jurassicworld.com/ | Colin Trevorrow | The park is open. | ... | Twenty-two years after the events of Jurassic ... | 124 | Action|Adventure|Science Fiction|Thriller | Universal Studios|Amblin Entertainment|Legenda... | 6/9/15 | 5562 | 6.5 | 2015 | 1.379999e+08 | 1.392446e+09 |
| 1 | 76341 | tt1392190 | 28.419936 | 150000000 | 378436354 | Mad Max: Fury Road | Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... | http://www.madmaxmovie.com/ | George Miller | What a Lovely Day. | ... | An apocalyptic story set in the furthest reach... | 120 | Action|Adventure|Science Fiction|Thriller | Village Roadshow Pictures|Kennedy Miller Produ... | 5/13/15 | 6185 | 7.1 | 2015 | 1.379999e+08 | 3.481613e+08 |
2 rows × 21 columns
# checking the names of the columns in the dataset
df.columns
Index(['id', 'imdb_id', 'popularity', 'budget', 'revenue', 'original_title',
'cast', 'homepage', 'director', 'tagline', 'keywords', 'overview',
'runtime', 'genres', 'production_companies', 'release_date',
'vote_count', 'vote_average', 'release_year', 'budget_adj',
'revenue_adj'],
dtype='object')
# dopping list of columns inplace
columns_to_be_dropped = ['id', 'imdb_id', 'budget', 'keywords', 'revenue', 'homepage', "overview", 'tagline', 'release_date']
df.drop(columns_to_be_dropped, axis = 1, inplace=True)
df.head(1)
| popularity | original_title | cast | director | runtime | genres | production_companies | vote_count | vote_average | release_year | budget_adj | revenue_adj | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 32.985763 | Jurassic World | Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... | Colin Trevorrow | 124 | Action|Adventure|Science Fiction|Thriller | Universal Studios|Amblin Entertainment|Legenda... | 5562 | 6.5 | 2015 | 1.379999e+08 | 1.392446e+09 |
# find numnber of duplicates in the DF
df.duplicated().sum()
1
#drop duplicates from the data frame
df.drop_duplicates(inplace=True)
# make sure duplicates are dropped
df.duplicated().sum()
0
#check for null values
df.isnull().sum()
popularity 0 original_title 0 cast 76 director 44 runtime 0 genres 23 production_companies 1030 vote_count 0 vote_average 0 release_year 0 budget_adj 0 revenue_adj 0 dtype: int64
# drop rows with null values
df.dropna(inplace=True)
# check for null values
df.isnull().sum().any()
False
Adding new column for the net profit
df["net_profit"] = df["revenue_adj"] - df["budget_adj"]
df.head(1)
| popularity | original_title | cast | director | runtime | genres | production_companies | vote_count | vote_average | release_year | budget_adj | revenue_adj | net_profit | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 32.985763 | Jurassic World | Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... | Colin Trevorrow | 124 | Action|Adventure|Science Fiction|Thriller | Universal Studios|Amblin Entertainment|Legenda... | 5562 | 6.5 | 2015 | 1.379999e+08 | 1.392446e+09 | 1.254446e+09 |
#checking for data types
df.dtypes
popularity float64 original_title object cast object director object runtime int64 genres object production_companies object vote_count int64 vote_average float64 release_year int64 budget_adj float64 revenue_adj float64 net_profit float64 dtype: object
Data types don't need any fixing.
Defining a function
Defining a function that take data frame column name we want to split into multiple columns\ and returns the maximum number of columns we are splitting on
# defining a function take df.column as argument
def get_max(column_name):
# initializing max number
max_num = 0
#iterating over each entry in column
for name in column_name:
# checkin the length of the spilited lis
x = len(name.split("|"))
# if length bigger than max then max = lenght and keeps iterating
if x > max_num:
max_num = x
# returns maximum number
return max_num
Defining a function
Defining a function that take string of column name and max number of columns we want to split on\ and returns a list of names of the new columns
# defining a function takes a string and maximum number
def naming_columns(column_name, max_num):
#initializing a list
name_list = []
#iterating over range of maximum number
for i in range(1, max_num + 1):
#creating a string of the name of the new column and append it to the list
name_list.append(column_name + "_" + str(i))
#return the list
return name_list
Splitting the columns cast , genres and production_companies
I will split each of the columns into multiple columns with new names using .str.split method
# initializing a variable for the column cast
cast = df['cast']
#splitting the names separated by | into new columns
df[naming_columns("cast", get_max(df.cast))] = cast.str.split('|',expand=True )
# initializing a variable for the column genres
genres = df['genres']
#splitting the names separated by | into new columns
df[naming_columns("genre", get_max(df.genres))] = genres.str.split('|',expand=True )
# initializing a variable for the column production_companies
p_company = df['production_companies']
#splitting the names separated by | into new columns
df[naming_columns("production_company", get_max(df.production_companies))] = p_company.str.split('|',expand=True )
df.head(2)
| popularity | original_title | cast | director | runtime | genres | production_companies | vote_count | vote_average | release_year | ... | genre_1 | genre_2 | genre_3 | genre_4 | genre_5 | production_company_1 | production_company_2 | production_company_3 | production_company_4 | production_company_5 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 32.985763 | Jurassic World | Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... | Colin Trevorrow | 124 | Action|Adventure|Science Fiction|Thriller | Universal Studios|Amblin Entertainment|Legenda... | 5562 | 6.5 | 2015 | ... | Action | Adventure | Science Fiction | Thriller | None | Universal Studios | Amblin Entertainment | Legendary Pictures | Fuji Television Network | Dentsu |
| 1 | 28.419936 | Mad Max: Fury Road | Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... | George Miller | 120 | Action|Adventure|Science Fiction|Thriller | Village Roadshow Pictures|Kennedy Miller Produ... | 6185 | 7.1 | 2015 | ... | Action | Adventure | Science Fiction | Thriller | None | Village Roadshow Pictures | Kennedy Miller Productions | None | None | None |
2 rows × 28 columns
Dropping the columns cast , genres and production_companies
#dropping columns already splitted
df.drop(["cast", "genres", "production_companies"], axis = 1, inplace=True)
df.head(1)
| popularity | original_title | director | runtime | vote_count | vote_average | release_year | budget_adj | revenue_adj | net_profit | ... | genre_1 | genre_2 | genre_3 | genre_4 | genre_5 | production_company_1 | production_company_2 | production_company_3 | production_company_4 | production_company_5 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 32.985763 | Jurassic World | Colin Trevorrow | 124 | 5562 | 6.5 | 2015 | 1.379999e+08 | 1.392446e+09 | 1.254446e+09 | ... | Action | Adventure | Science Fiction | Thriller | None | Universal Studios | Amblin Entertainment | Legendary Pictures | Fuji Television Network | Dentsu |
1 rows × 25 columns
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 9772 entries, 0 to 10865 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 popularity 9772 non-null float64 1 original_title 9772 non-null object 2 director 9772 non-null object 3 runtime 9772 non-null int64 4 vote_count 9772 non-null int64 5 vote_average 9772 non-null float64 6 release_year 9772 non-null int64 7 budget_adj 9772 non-null float64 8 revenue_adj 9772 non-null float64 9 net_profit 9772 non-null float64 10 cast_1 9772 non-null object 11 cast_2 9688 non-null object 12 cast_3 9635 non-null object 13 cast_4 9566 non-null object 14 cast_5 9359 non-null object 15 genre_1 9772 non-null object 16 genre_2 7813 non-null object 17 genre_3 4710 non-null object 18 genre_4 1848 non-null object 19 genre_5 507 non-null object 20 production_company_1 9772 non-null object 21 production_company_2 6385 non-null object 22 production_company_3 3812 non-null object 23 production_company_4 2050 non-null object 24 production_company_5 1124 non-null object dtypes: float64(5), int64(3), object(17) memory usage: 1.9+ MB
Here we are plotting the the frequency of the movie production over the years.
The plot utilizes hovering so you can hover over the year and see the total number of movies produced in this year.
# plotting a histogram of the frequency of prodcution across the years from 1960 to 2015
fig = px.histogram(df.release_year, x="release_year", title='Production rate across years',
labels={'release_year':'Release Year'}).update_layout(yaxis_title="Frequency")
fig.update_layout(bargap=0.2, xaxis_tickangle=-45)
fig.update_xaxes(dtick=5)
fig.show()
Defining a function
Defining a function that take a dafaframe, column name and columns count\ and returns a frequency count of all the columns combined
# defining a function takes dataframe, a string and count
def freq_by_col(df, col_name, count):
# list of columns names
cols = naming_columns(col_name, count)
# initializing a pandas series of type int
a = pd.Series(dtype='int')
# iterationg over columns in dataframe
for col in cols:
# initializing a temp series and store value counts in it
temp = df[col].value_counts()
# initialize series a and add the temp to it and keep iterating on it until end of columns
a = a.add(temp,fill_value=0)
# convert series to data frame renaming the index with the column name from function argument and renameing the column with count
a = (a.to_frame().rename_axis(col_name).rename(columns={0:'count'}))
return a
Here I am getting the top 5 genres overall
top_5_genres = freq_by_col(df,"genre", 5).sort_values(by="count", ascending = False).head(5).index.values.tolist()
Here we are plotting the total number of movies and the proportional number of movies produced each year for the top 5 genre.
The plot utilizes hovering so you can hover over the year and the genre to see the actual number of movies produced in this year for this specific genre or the percentage of specific genre in specific year.
# group by release year and apply the freq_by_col function to get the value count of genres for each year
genre_per_year = df.groupby("release_year").apply(lambda x: freq_by_col(x,'genre',5))
# reset the index of the data frame
reset_df = genre_per_year.reset_index()
# turn the genres into columns and release_year to indices to be able to plot it
pivotted_df = reset_df.pivot_table(values='count',index='release_year', columns="genre")
# select top 5 genres
pivotted_df = pivotted_df[top_5_genres]
fig = px.bar(pivotted_df, x = pivotted_df.index, y = pivotted_df.columns,
labels={
#change hovering labels
"genre": "Genre", "release_year": "Release Year", "value": "Value"
})
fig.update_layout(
title='Total number of movies per year based on genre',
xaxis_title = "Release Year",
yaxis_title = "Number of movies",
xaxis_tickangle=-45
)
fig.update_xaxes(dtick=5)
fig.show()
# this normalizes the data as it gets the sum of all genres per year and divides each genre count by the sum
pivotted_df = pivotted_df.div(pivotted_df.sum(axis=1), axis=0)
fig_2 = px.bar(pivotted_df, x = pivotted_df.index, y = pivotted_df.columns,
labels={
#change hovering labels
"genre": "Genre", "release_year": "Release Year", "value": "Value"
},
hover_data={'value':':%'})
fig_2.update_layout(
title='Normalized movies proportion per year based on genre',
xaxis_title = "Release Year",
yaxis_title = "Percentage of movie genre",
xaxis_tickangle=-45
)
fig_2.update_xaxes(dtick=5)
fig_2.show()
- We can see that the drama genre is the most dominating genre.
- We can also conclude that the rate of production for each genre almost stayed the same over the years.
Minor issue to be fixed
There is a minor issue where some of the movies has 0 budjet_adj or revenue_adj
so I cleaned the data frame here for the sake of net_profit analysis
# cleaning the data from movies that has revenue or budget = 0
df_cleaned = df.query('budget_adj != 0.0 and revenue_adj != 0.0')
Here we are plotting the net profit of each of the top 5 genres and the proportional net profit of each of the top 5 genres for each year.
The plot utilizes hovering so you can hover over the year and the genre to see the net profit for this year or the percentage of net profit for the genre in specific year.
# group by release year and net profit and apply the freq_by_col function to get the value count of genres for each year
genre_per_year = df_cleaned.groupby(["release_year", "net_profit"]).apply(lambda x: freq_by_col(x,'genre',5))
#reset the index of the data frame
reset_df2 = genre_per_year.reset_index()
# turn the genres into columns and release_year to indices to be able to plot it and net profit to values
pivotted_df2 = reset_df2.pivot_table(values=['net_profit'],index='release_year', columns="genre")
# renaming columns to remove the multiindex layers
column_names = ["Action","Adventure","Animation","Comedy","Crime","Documentary","Drama","Family",
"Fantasy","Foreign","History","Horror","Music","Mystery","Romance","Science Fiction",
"TV Movie","Thriller","War", "Western"]
pivotted_df2.columns = column_names
# select top 5 genres
pivotted_df2 = pivotted_df2[top_5_genres]
fig = px.bar(pivotted_df2, x = pivotted_df2.index, y = pivotted_df2.columns,
labels={
#change hovering labels
"variable": "Genre", "release_year": "Release Year", "value": "Net Profit"
})
fig.update_layout(
title='Total net profit for each genre per year',
xaxis_title = "Year",
yaxis_title = "Net Profit",
xaxis_tickangle=-45
)
fig.update_xaxes(dtick=5)
fig.show()
# this normalizes the data as it gets the sum of all genres per year and divides each genre count by the sum
pivotted_df2 = pivotted_df2.div(pivotted_df2.sum(axis=1), axis=0)
fig_2 = px.bar(pivotted_df2, x = pivotted_df2.index, y = pivotted_df2.columns,
labels={
#change hovering labels
"variable": "Genre", "release_year": "Release Year", "value": "Net Profit"
},
hover_data={'value':':%'})
fig_2.update_layout(
title='Normalized net profit proportion per genre for year',
xaxis_title = "Year",
yaxis_title = "Percentage of net profit",
xaxis_tickangle=-45
)
fig_2.update_xaxes(dtick=5)
fig_2.show()
#make a dataframe where the index is the top 10 directors and the values are the total net profit for each director
most_success_director = df_cleaned.groupby(by="director").sum().net_profit.sort_values(ascending = False).head(10).to_frame()
Here we are plotting the total net profit for top 10 directors.
The plot utilizes hovering so you can to see the total net profit and the director.
fig = px.bar(most_success_director, x = most_success_director.columns, y = most_success_director.index,
orientation='h',
labels={
#change hovering labels
"variable": "Net Profit",
"director": "Director", "value": "Total Net Profit"
})
fig.update_layout(
title='Toatal net profit for top 10 directors',
xaxis_title = "Total Net Profit",
yaxis_title = "Director",
showlegend=False
)
fig.show()
Here we are plotting the relation between average rating and net profit.
The plot utilizes hovering so you can see the net profit and average rating.
fig = px.scatter(df_cleaned, x = "vote_average", y = "net_profit",
trendline="ols",
trendline_color_override="rgb(58, 56, 87)",
labels={
#change hovering labels
"vote_average": "Average Rating", "net_profit": "Net Profit"
})
fig.update_layout(
title='Average rating and net profit relation',
xaxis_title = "Average Rating",
yaxis_title = "Net Profit",
)
fig.update_layout(showlegend=False)
fig.show()
Here we are plotting the top 10 directors when it comes to genres diversity.
The plot utilizes hovering so you can see the genre, genre count, director and total count.
# group by director and apply the freq_by_col function to get the value count of genres for each director
director_genre = df_cleaned.groupby(["director"]).apply(lambda x: freq_by_col(x,'genre',5))
# turn the genres into columns and directors to indices to be able to plot it
pivotted_df_d = director_genre.pivot_table(values=['count'],index='director', columns="genre")
# renaming columns to remove the multiindex layers
pivotted_df_d.columns = column_names
# sorting the data frame with least null values across row to select the most diverse and select top 10 deirectors
pivotted_df_d = pivotted_df_d.loc[pivotted_df_d.isnull().sum(1).sort_values().index].head(10)
# get the x coordinates for text anotation which is total number of movies
x_coordinates = pivotted_df_d.sum(axis=1).tolist()
# add 5 to each element in list
x_coordinates = list(np.asarray(x_coordinates) + 4)
# convert list of ints to list of strings
x_coordinates = list(map(str, x_coordinates))
# get the y coordinates which is the director name
y_coordinates = pivotted_df_d.index.tolist()
# get the values which will be printed
values = list(pivotted_df_d.count(axis=1).values)
# convert list of ints to list of strings
values = list(map(str, values))
fig = px.bar(pivotted_df_d, x = pivotted_df_d.columns, y = pivotted_df_d.index,
orientation='h',
labels={
#change hovering labels
"variable": "Genre", "value": "Count", "director": "Director",
"sum_": "sum_"
})
fig.update_layout(
title='Top 10 most diverse directors when it comes to genres',
xaxis_title = "Movie Count",
yaxis_title = "Director",
)
fig.update_xaxes(dtick=10)
fig.update_yaxes(dtick=1)
#loop over the coordinates of each bar and add the annotation
for x,y,value in zip(x_coordinates, y_coordinates, values):
fig.add_annotation(x=x, y=y, text= value + " genre",showarrow=False)
fig.show()
The limitation I found was that there is a large number of movies that had zero budget and revenue, so I had to remove this movies when I was manipulating anything that comes to profit which limits the data and decreases the confidence regarding the revenue questions.